Feat/medkit integration#1
Conversation
Declared in package.xml and CMakeLists.txt but never #included anywhere in src/, include/ or test/. Build is green without it; manymove_bringup keeps the dep where it is actually used.
All bt_client_*, collision_test and tutorials shared "bt_client_node", so source_id was indistinguishable across them. Each now matches its executable; the static logger in tree_helper.cpp and two log-level remaps in manymove_bringup launch files moved with them.
No CMake option, no conditional <depend>: this fork exists for the medkit integration. If we ever upstream to pastoriomarco/manymove, put it behind a toggle so vanilla users do not pull medkit transitively.
…tall helper Three pieces for BT action node instrumentation: fault_codes.hpp (MANYMOVE_* constants), fault_reporting.hpp (the FaultReporting capability class), and installFaultReporter in main_imports_helper.hpp wired into all 15 BT entry-point mains. Capability class instead of direct FaultReporter member because BT.CPP factory constructs nodes with fixed signatures, so we pull the reporter from the blackboard in the FaultReporting ctor (same pattern as `node_` already fetched in every action node).
Adds FaultReporting as a second base class to every Action node across planner/objects/signals/gripper/logic/isaac, plus the matching ctor init list call. Condition and decorator nodes are intentionally left out: their FAILURE return is normal control flow.
…austion Four reportFault sites in MoveManipulatorAction: - kPlannerCollisionDetected on the onStart collision branch - kPlannerEstopTriggered when stop_execution is set before motion - kPlannerRetryAttempt on every failed attempt (soft fault, throttled locally by LocalFilter) - kPlannerRetriesExhausted once max_tries is reached reportFaultPassed pairs kPlannerRetryAttempt in the success branch so the local filter resets when a retry eventually succeeds. Skipped the input/blackboard-missing FAILURE returns (lines 96, 147, 174 and ResetTrajectories@356); those are programmer errors caught at startup, not operational faults.
…ait timeout Five reportFault sites covering SetOutput/GetInput action failures, CheckRobotState NOT_READY (CRITICAL), WaitForInput server-unavailable and the wait-timeout path. reportFaultPassed pairs kRobotNotReady on CheckRobotState success and kSignalWaitInputTimeout on the WaitForInput desired-value branch so LocalFilter resets cleanly.
…timeout Five reportFault sites cover the operational failures of AddCollisionObjectAction, RemoveCollisionObjectAction, AttachDetachObjectAction, GetObjectPoseAction and WaitForObjectAction. WaitForObject pairs reportFaultPassed in the success branch. CheckObjectExistsAction is intentionally not instrumented: its FAILURE on \"object missing\" is normal control flow used by callers as a condition, not an operational fault.
Four reportFault sites: GripperCommand server timeout and goal-not-reached, GripperTraj server unavailable and trajectory-failed result. PublishJointStateAction throws on bad inputs and never returns FAILURE, so it has nothing to instrument.
GetLinkPoseAction reports kTfLookupFailed when tf2 raises a TransformException. WaitForKeyBool reports kWaitKeyTimeout on the timeout branch and pairs reportFaultPassed in the success path. Adds kWaitKeyTimeout to fault_codes.hpp. Configuration-error FAILUREs (missing inputs, wrong vector sizes) are left out, as are condition-style nodes (CheckPoseDistance, CheckPoseBounds, CheckKeyBoolValue) where FAILURE is expected control flow.
Two reportFault sites in FoundationPoseAlignmentNode: detection topic silent for too long, and no detection passing the target/min_score filter within the timeout. Both raise kIsaacFoundationPoseFailed. GetEntityPoseNode/SetEntityPoseNode service-error paths and Isaac TF timeouts are left for the isaac demo work in Phase 3, where they'll have specific fault codes once the demo wiring exposes them.
…ager fake_fault_manager.hpp is a small fixture that hosts the /fault_manager/report_fault service and records every accepted request, matching the existing FakeXxxServer pattern in this test directory. test_fault_reporting.cpp uses it to verify the end-to-end plumbing: - installFaultReporter puts a reporter on the blackboard - FaultReporting::reportFault forwards FAILED events with the right fault_code, severity, description and source_id (FQN of the BT client node) - reportFaultPassed forwards a PASSED event for healing - mixin silently no-ops when no reporter is installed (e.g. unit tests building a tree without bt_client scaffolding) Per-instrumentation-site assertions are out of scope here; those would need to drive every BT action node through real action server fakes and belong with the demo work in Phase 3.
- docs/FAULT_CODES.md: full table of MANYMOVE_* codes with severity, trigger condition and pairing semantics, grouped by subsystem. - README.md: short note pointing fork users at the medkit dependency and the catalogue. - CHANGELOG.rst: entry under Forthcoming summarising the BT node rename, the hard medkit dep and the dropped topic_based_ros2_control declaration. - ament_uncrustify --reformat run across the modified manymove_cpp_trees sources. - Lint fixes for new files (cpplint header order / memory include / guard style; ament_copyright BSD-3 boilerplate match). The remaining test_action_nodes_gripper failure is a pre-existing convertFromString registration issue in upstream manymove (verified by running the test from the parent commit before any medkit changes).
After wiring ros2_medkit_fault_reporter into manymove_cpp_trees the CI job needs the medkit packages on the runner. Pulling them via apt from rosdistro keeps the workflow simple (no source overlay clone).
- Wire kRobotResetFailed across ResetRobotStateAction failure branches (goal rejection + non-SUCCEEDED result for reset, unload-traj and load-traj steps). - Instrument Isaac TF transform timeouts with kIsaacFoundationPoseFailed to match the sibling no-message and no-valid-detection timeouts. - Replace silent FAILURE in GripperTrajAction::onRunning !goal_sent_ with BT::RuntimeError, matching the programmer-error policy. - Add reportFaultPassed on onHalted for WaitForInputAction and WaitForObjectAction so abandoned waits do not leave kSignalWaitInputTimeout / kObjectWaitTimeout lingering in FaultManager. - Drop orphan codes kPlannerMissingMoveId (programmer-error path uses throw) and kObjectExistsCheckFailed (CheckObjectExists is intentionally not instrumented; FAILURE is normal control flow there). - Pin medkit apt packages to MEDKIT_VERSION=0.4.0-1 so an upstream bump does not silently change CI behaviour. - Drop humble from the CI matrix until ros-humble-ros2-medkit-* debs appear on packages.ros.org (rosdistro entry exists; debs do not yet). - Update docs/FAULT_CODES.md: add kRobotResetFailed, expand the Isaac FoundationPose entry to cover TF timeouts, and document the reportFaultPassed-on-halt pairing for both wait codes.
bburda
left a comment
There was a problem hiding this comment.
The integration uses the FaultReporter API correctly and the FaultReporting mixin is a clean abstraction. Inline comments cover a few real concerns: an unthrottled fault flood in ResetRobotStateAction, fault codes that never heal (collision, e-stop, retry on halt), two sites that emit unconditional PASSED events, an unused mixin inheritance on two BT nodes, and a missing override on installFaultReporter. The last point also touches an upstream contradiction between the report_passed docstring and LocalFilter::should_forward_passed behavior that affects the reliability of the pairing pattern used throughout this PR.
| name().c_str()); | ||
| action_result_.success = false; | ||
| result_received_ = true; | ||
| reportFault( |
There was a problem hiding this comment.
ResetRobotStateAction calls reportFault(kRobotResetFailed, kSeverityError, ...) from 8 separate callback paths (rejection + result-failure across unload-trajectory, reset-robot-state, and load-trajectory). All sites use kSeverityError, which equals the default bypass_severity=2 and therefore bypasses LocalFilter entirely.
A single failed reset can drive up to 6 ReportFault.srv calls into the FaultManager. Because each FAILED event decrements the AUTOSAR-DEM debounce counter, one logical incident pushes the fault aggressively past confirmation_threshold and biases the lifecycle.
Guard with a per-run bool fault_reported_ set on first emission and cleared in onStart, or downgrade these sites to kSeverityWarn so LocalFilter coalesces them within its window.
| config().blackboard->set(robot_prefix_ + "collision_detected", false); | ||
| config().blackboard->set(robot_prefix_ + "stop_execution", true); | ||
|
|
||
| reportFault( |
There was a problem hiding this comment.
Three planner faults never emit a paired reportFaultPassed, leaving them CONFIRMED in the FaultManager after the underlying condition clears on hardware:
kPlannerCollisionDetected(this line) - never healed.kPlannerEstopTriggered(L137) - never healed even after operator releases the e-stop.kPlannerRetryAttempt- healed on theonRunningsuccess path (L231) but not inonHalted(L280), so a halted retry loop leaves a lingering soft fault. Compare withWaitForInputAction::onHaltedandWaitForObjectAction::onHaltedwhich do emit PASSED on halt.
Emit reportFaultPassed(kPlannerCollisionDetected) / reportFaultPassed(kPlannerEstopTriggered) on the early path of onStart when the corresponding flag reads false, and add reportFaultPassed(kPlannerRetryAttempt) in onHalted.
|
|
||
| RCLCPP_INFO(node_->get_logger(), "[MoveManipulatorAction] success => returning SUCCESS"); | ||
| // Heal the per-attempt soft fault if any retry was reported earlier. | ||
| reportFaultPassed(fault_codes::kPlannerRetryAttempt); |
There was a problem hiding this comment.
Two issues at this site:
-
The PASSED is unconditional. On the first-attempt success path
current_try_ == 0, so nokPlannerRetryAttemptwas ever emitted, yet this still pushes a PASSED event. Guard withif (current_try_ > 0). -
ros2_medkit_fault_reporter::LocalFilter::should_forward_passedapplies the same threshold/window logic as FAILED events, even thoughFaultReporter::report_passed's docstring states PASSED bypasses local filtering. With the defaultthreshold=3, a single PASSED following a single FAILED will not propagate to the FaultManager. The pairing pattern used here and atkRobotNotReady,kSignalWaitInputTimeout,kObjectWaitTimeout,kWaitKeyTimeoutonly heals reliably under repeated retry cycles. Worth fixing upstream so PASSED truly bypasses the filter, but in the meantime callers should know healing is best-effort.
| // mid-wait, the timeout condition no longer holds. Without this, | ||
| // any earlier `kSignalWaitInputTimeout` would linger in FaultManager | ||
| // until the next successful wait. | ||
| reportFaultPassed(fault_codes::kSignalWaitInputTimeout); |
There was a problem hiding this comment.
This emits reportFaultPassed(kSignalWaitInputTimeout) on every halt, even when no timeout was previously reported. Combined with the PASSED counter behavior in LocalFilter::should_forward_passed, this accumulates spurious PASSED events that bias subsequent healing decisions.
Track a bool timeout_reported_ set at the FAILED emission site (L883) and only fire the PASSED here if it is set.
| PublishJointStateAction::PublishJointStateAction( | ||
| const std::string & name, const BT::NodeConfiguration & config) | ||
| : BT::SyncActionNode(name, config) | ||
| : BT::SyncActionNode(name, config), FaultReporting(config.blackboard) |
There was a problem hiding this comment.
PublishJointStateAction inherits FaultReporting but never calls reportFault (the same applies to ResetTrajectories in action_nodes_planner.cpp:354).
Either drop the inheritance to keep the surface minimal, or add a short comment justifying the wiring (e.g. reserved for a planned future site). As-is, the empty base looks like an oversight at the call site.
| // SOVD apps[].ros_binding linkage used by the medkit gateway. | ||
| inline void installFaultReporter(BT::Blackboard::Ptr blackboard, rclcpp::Node::SharedPtr node) | ||
| { | ||
| auto reporter = std::make_shared<ros2_medkit_fault_reporter::FaultReporter>( |
There was a problem hiding this comment.
FaultReporter's constructor accepts a third service_name argument (defaults to /fault_manager/report_fault, rooted at /). This helper hard-codes that default and offers no override.
For namespaced test rigs or multi-robot setups where the FaultManager runs under a namespace, callers cannot remap without bypassing this helper. Add an optional parameter:
inline void installFaultReporter(
BT::Blackboard::Ptr blackboard, rclcpp::Node::SharedPtr node,
const std::string & service_name = "/fault_manager/report_fault")
Summary
Wires
ros2_medkit_fault_reporterintomanymove_cpp_treesso every BT action node reports operational faults directly to aFaultManager, instead of routing through/diagnostics. This is the native-instrumentation path the medkit upstream design recommends.What changes
Plumbing
fault_codes.hpp(101 lines): catalogue ofMANYMOVE_*constants grouped by subsystem.fault_reporting.hpp(110 lines):FaultReportingcapability class. Pulls the reporter from the blackboard, same pattern action nodes already use fornode_.installFaultReporterhelper inmain_imports_helper.hpp, wired into all 15 BT executable mains.package.xml+CMakeLists.txt: hard dep onros2_medkit_fault_reporterandros2_medkit_msgs. No CMake toggle (fork-only).Instrumentation (21 action node classes)
reportFaultPassed), retries exhaustedCondition and decorator nodes are intentionally left out: their
FAILUREreturn is normal control flow. Programmer-errorFAILUREreturns (missing inputs, wrong vector sizes) are also skipped.Tests
fake_fault_manager.hpp: matches the existingFakeXxxServerpattern in the test tree.test_fault_reporting.cpp: end-to-end roundtrip covering reporter install, FAILED forwarding, PASSED healing and the silent no-op path when no reporter is present.Docs + CI
docs/FAULT_CODES.md: full catalogue with severity, trigger and pairing semantics.README.md+CHANGELOG.rstentries.ros2_medkit_*debs from rosdistro so the overlay builds.Cleanup
bt_client_*,collision_testand tutorials get unique node names sosource_idis distinguishable per executable.topic_based_ros2_controldeclaration frommanymove_cpp_trees(kept where actually used inmanymove_bringup).Notes
test_action_nodes_gripperfailure (convertFromString registration) is upstream and unrelated to this change.